Dataset statistics
| Number of variables | 10 |
|---|---|
| Number of observations | 89197 |
| Missing cells | 0 |
| Missing cells (%) | 0.0% |
| Duplicate rows | 0 |
| Duplicate rows (%) | 0.0% |
| Total size in memory | 6.8 MiB |
| Average record size in memory | 80.0 B |
Variable types
| Numeric | 8 |
|---|---|
| Categorical | 2 |
category_id is highly correlated with video_id | High correlation |
video_id is highly correlated with category_id | High correlation |
category_id is highly correlated with video_id | High correlation |
video_id is highly correlated with category_id | High correlation |
category_id is highly correlated with video_id and 2 other fields | High correlation |
video_id is highly correlated with category_id and 2 other fields | High correlation |
age is highly correlated with profession | High correlation |
gender is highly correlated with engagement_score | High correlation |
profession is highly correlated with age | High correlation |
followers is highly correlated with category_id and 2 other fields | High correlation |
views is highly correlated with category_id and 2 other fields | High correlation |
engagement_score is highly correlated with gender | High correlation |
row_id is uniformly distributed | Uniform |
row_id has unique values | Unique |
Reproduction
| Analysis started | 2022-02-13 16:23:44.077367 |
|---|---|
| Analysis finished | 2022-02-13 16:23:58.306535 |
| Duration | 14.23 seconds |
| Software version | pandas-profiling v3.1.0 |
| Download configuration | config.json |
| Distinct | 89197 |
|---|---|
| Distinct (%) | 100.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 44599 |
| Minimum | 1 |
|---|---|
| Maximum | 89197 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negative | 0 |
| Negative (%) | 0.0% |
| Memory size | 697.0 KiB |
Quantile statistics
| Minimum | 1 |
|---|---|
| 5-th percentile | 4460.8 |
| Q1 | 22300 |
| median | 44599 |
| Q3 | 66898 |
| 95-th percentile | 84737.2 |
| Maximum | 89197 |
| Range | 89196 |
| Interquartile range (IQR) | 44598 |
Descriptive statistics
| Standard deviation | 25749.10032 |
|---|---|
| Coefficient of variation (CV) | 0.5773470328 |
| Kurtosis | -1.2 |
| Mean | 44599 |
| Median Absolute Deviation (MAD) | 22299 |
| Skewness | 0 |
| Sum | 3978097003 |
| Variance | 663016167.2 |
| Monotonicity | Strictly increasing |
Histogram with fixed size bins (bins=50)
| Value | Count | Frequency (%) |
| 2047 | 1 | < 0.1% |
| 19100 | 1 | < 0.1% |
| 2708 | 1 | < 0.1% |
| 661 | 1 | < 0.1% |
| 6806 | 1 | < 0.1% |
| 4759 | 1 | < 0.1% |
| 27288 | 1 | < 0.1% |
| 25241 | 1 | < 0.1% |
| 31386 | 1 | < 0.1% |
| 29339 | 1 | < 0.1% |
| Other values (89187) | 89187 |
| Value | Count | Frequency (%) |
| 1 | 1 | |
| 2 | 1 | |
| 3 | 1 | |
| 4 | 1 | |
| 5 | 1 | |
| 6 | 1 | |
| 7 | 1 | |
| 8 | 1 | |
| 9 | 1 | |
| 10 | 1 |
| Value | Count | Frequency (%) |
| 89197 | 1 | |
| 89196 | 1 | |
| 89195 | 1 | |
| 89194 | 1 | |
| 89193 | 1 | |
| 89192 | 1 | |
| 89191 | 1 | |
| 89190 | 1 | |
| 89189 | 1 | |
| 89188 | 1 |
user_id
Real number (ℝ≥0)
| Distinct | 27734 |
|---|---|
| Distinct (%) | 31.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 13881.90981 |
| Minimum | 1 |
|---|---|
| Maximum | 27734 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negative | 0 |
| Negative (%) | 0.0% |
| Memory size | 697.0 KiB |
Quantile statistics
| Minimum | 1 |
|---|---|
| 5-th percentile | 1379 |
| Q1 | 6945 |
| median | 13892 |
| Q3 | 20819 |
| 95-th percentile | 26345 |
| Maximum | 27734 |
| Range | 27733 |
| Interquartile range (IQR) | 13874 |
Descriptive statistics
| Standard deviation | 8005.582771 |
|---|---|
| Coefficient of variation (CV) | 0.5766917436 |
| Kurtosis | -1.198788511 |
| Mean | 13881.90981 |
| Median Absolute Deviation (MAD) | 6939 |
| Skewness | -0.003594403317 |
| Sum | 1238224709 |
| Variance | 64089355.5 |
| Monotonicity | Not monotonic |
Histogram with fixed size bins (bins=50)
| Value | Count | Frequency (%) |
| 5198 | 10 | < 0.1% |
| 13218 | 10 | < 0.1% |
| 7157 | 10 | < 0.1% |
| 13410 | 10 | < 0.1% |
| 1448 | 10 | < 0.1% |
| 3759 | 9 | < 0.1% |
| 16046 | 9 | < 0.1% |
| 19970 | 9 | < 0.1% |
| 24051 | 9 | < 0.1% |
| 1691 | 9 | < 0.1% |
| Other values (27724) | 89102 |
| Value | Count | Frequency (%) |
| 1 | 3 | |
| 2 | 5 | |
| 3 | 4 | |
| 4 | 3 | |
| 5 | 4 | |
| 6 | 2 | < 0.1% |
| 7 | 2 | < 0.1% |
| 8 | 2 | < 0.1% |
| 9 | 2 | < 0.1% |
| 10 | 4 |
| Value | Count | Frequency (%) |
| 27734 | 2 | < 0.1% |
| 27733 | 3 | |
| 27732 | 2 | < 0.1% |
| 27731 | 2 | < 0.1% |
| 27730 | 3 | |
| 27729 | 3 | |
| 27728 | 3 | |
| 27727 | 6 | |
| 27726 | 4 | |
| 27725 | 4 |
| Distinct | 47 |
|---|---|
| Distinct (%) | 0.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 18.32373286 |
| Minimum | 1 |
|---|---|
| Maximum | 47 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negative | 0 |
| Negative (%) | 0.0% |
| Memory size | 697.0 KiB |
Quantile statistics
| Minimum | 1 |
|---|---|
| 5-th percentile | 4 |
| Q1 | 8 |
| median | 16 |
| Q3 | 26 |
| 95-th percentile | 41 |
| Maximum | 47 |
| Range | 46 |
| Interquartile range (IQR) | 18 |
Descriptive statistics
| Standard deviation | 11.6751543 |
|---|---|
| Coefficient of variation (CV) | 0.6371602548 |
| Kurtosis | -0.8355649802 |
| Mean | 18.32373286 |
| Median Absolute Deviation (MAD) | 9 |
| Skewness | 0.4842759918 |
| Sum | 1634422 |
| Variance | 136.3092279 |
| Monotonicity | Not monotonic |
Histogram with fixed size bins (bins=47)
| Value | Count | Frequency (%) |
| 5 | 8104 | 9.1% |
| 8 | 6313 | 7.1% |
| 25 | 4880 | 5.5% |
| 19 | 4679 | 5.2% |
| 21 | 4043 | 4.5% |
| 12 | 3889 | 4.4% |
| 11 | 3766 | 4.2% |
| 4 | 3678 | 4.1% |
| 34 | 3297 | 3.7% |
| 16 | 3264 | 3.7% |
| Other values (37) | 43284 |
| Value | Count | Frequency (%) |
| 1 | 1810 | 2.0% |
| 2 | 167 | 0.2% |
| 3 | 1845 | 2.1% |
| 4 | 3678 | |
| 5 | 8104 | |
| 6 | 1399 | 1.6% |
| 7 | 1885 | 2.1% |
| 8 | 6313 | |
| 9 | 1886 | 2.1% |
| 10 | 1217 | 1.4% |
| Value | Count | Frequency (%) |
| 47 | 47 | 0.1% |
| 46 | 236 | 0.3% |
| 45 | 227 | 0.3% |
| 44 | 214 | 0.2% |
| 43 | 1019 | |
| 42 | 2178 | |
| 41 | 545 | 0.6% |
| 40 | 458 | 0.5% |
| 39 | 1339 | |
| 38 | 534 | 0.6% |
| Distinct | 175 |
|---|---|
| Distinct (%) | 0.2% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 77.7153828 |
| Minimum | 1 |
|---|---|
| Maximum | 175 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negative | 0 |
| Negative (%) | 0.0% |
| Memory size | 697.0 KiB |
Quantile statistics
| Minimum | 1 |
|---|---|
| 5-th percentile | 6 |
| Q1 | 34 |
| median | 76 |
| Q3 | 120 |
| 95-th percentile | 152 |
| Maximum | 175 |
| Range | 174 |
| Interquartile range (IQR) | 86 |
Descriptive statistics
| Standard deviation | 48.46965588 |
|---|---|
| Coefficient of variation (CV) | 0.6236816204 |
| Kurtosis | -1.258102332 |
| Mean | 77.7153828 |
| Median Absolute Deviation (MAD) | 42 |
| Skewness | 0.07962899431 |
| Sum | 6931979 |
| Variance | 2349.307541 |
| Monotonicity | Not monotonic |
Histogram with fixed size bins (bins=50)
| Value | Count | Frequency (%) |
| 112 | 1337 | 1.5% |
| 53 | 1334 | 1.5% |
| 1 | 1282 | 1.4% |
| 65 | 1103 | 1.2% |
| 42 | 1077 | 1.2% |
| 46 | 938 | 1.1% |
| 4 | 932 | 1.0% |
| 10 | 921 | 1.0% |
| 5 | 913 | 1.0% |
| 87 | 902 | 1.0% |
| Other values (165) | 78458 |
| Value | Count | Frequency (%) |
| 1 | 1282 | |
| 2 | 622 | |
| 3 | 167 | 0.2% |
| 4 | 932 | |
| 5 | 913 | |
| 6 | 732 | |
| 7 | 716 | |
| 8 | 716 | |
| 9 | 890 | |
| 10 | 921 |
| Value | Count | Frequency (%) |
| 175 | 47 | 0.1% |
| 174 | 70 | 0.1% |
| 173 | 80 | 0.1% |
| 172 | 154 | |
| 171 | 78 | 0.1% |
| 170 | 78 | 0.1% |
| 169 | 154 | |
| 168 | 160 | |
| 167 | 227 | |
| 166 | 66 | 0.1% |
| Distinct | 58 |
|---|---|
| Distinct (%) | 0.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 24.84861598 |
| Minimum | 10 |
|---|---|
| Maximum | 68 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negative | 0 |
| Negative (%) | 0.0% |
| Memory size | 697.0 KiB |
Quantile statistics
| Minimum | 10 |
|---|---|
| 5-th percentile | 13 |
| Q1 | 18 |
| median | 23 |
| Q3 | 32 |
| 95-th percentile | 40 |
| Maximum | 68 |
| Range | 58 |
| Interquartile range (IQR) | 14 |
Descriptive statistics
| Standard deviation | 8.955535195 |
|---|---|
| Coefficient of variation (CV) | 0.3604037827 |
| Kurtosis | -0.2368663206 |
| Mean | 24.84861598 |
| Median Absolute Deviation (MAD) | 7 |
| Skewness | 0.5799597915 |
| Sum | 2216422 |
| Variance | 80.20161063 |
| Monotonicity | Not monotonic |
Histogram with fixed size bins (bins=50)
| Value | Count | Frequency (%) |
| 18 | 4870 | 5.5% |
| 19 | 4528 | 5.1% |
| 20 | 4399 | 4.9% |
| 17 | 4356 | 4.9% |
| 16 | 4014 | 4.5% |
| 15 | 3875 | 4.3% |
| 21 | 3722 | 4.2% |
| 22 | 3576 | 4.0% |
| 14 | 3086 | 3.5% |
| 23 | 2971 | 3.3% |
| Other values (48) | 49800 |
| Value | Count | Frequency (%) |
| 10 | 752 | 0.8% |
| 11 | 1171 | 1.3% |
| 12 | 1776 | 2.0% |
| 13 | 2588 | |
| 14 | 3086 | |
| 15 | 3875 | |
| 16 | 4014 | |
| 17 | 4356 | |
| 18 | 4870 | |
| 19 | 4528 |
| Value | Count | Frequency (%) |
| 68 | 6 | |
| 67 | 3 | < 0.1% |
| 66 | 5 | |
| 64 | 5 | |
| 63 | 8 | |
| 62 | 3 | < 0.1% |
| 61 | 3 | < 0.1% |
| 60 | 3 | < 0.1% |
| 59 | 4 | |
| 58 | 8 |
| Distinct | 2 |
|---|---|
| Distinct (%) | < 0.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory size | 697.0 KiB |
| Male | |
|---|---|
| Female |
Length
| Max length | 6 |
|---|---|
| Median length | 4 |
| Mean length | 4.825139859 |
| Min length | 4 |
Characters and Unicode
| Total characters | 0 |
|---|---|
| Distinct characters | 0 |
| Distinct categories | 0 ? |
| Distinct scripts | 0 ? |
| Distinct blocks | 0 ? |
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.
Unique
| Unique | 0 ? |
|---|---|
| Unique (%) | 0.0% |
Sample
| 1st row | Male |
|---|---|
| 2nd row | Female |
| 3rd row | Male |
| 4th row | Male |
| 5th row | Male |
Common Values
| Value | Count | Frequency (%) |
| Male | 52397 | |
| Female | 36800 |
Length
Histogram of lengths of the category
Pie chart
| Value | Count | Frequency (%) |
| male | 52397 | |
| female | 36800 |
Most occurring characters
| Value | Count | Frequency (%) |
| No values found. | ||
Most occurring categories
| Value | Count | Frequency (%) |
| No values found. | ||
Most frequent character per category
Most occurring scripts
| Value | Count | Frequency (%) |
| No values found. | ||
Most frequent character per script
Most occurring blocks
| Value | Count | Frequency (%) |
| No values found. | ||
Most frequent character per block
| Distinct | 3 |
|---|---|
| Distinct (%) | < 0.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Memory size | 697.0 KiB |
| Student | |
|---|---|
| Other | |
| Working Professional |
Length
| Max length | 20 |
|---|---|
| Median length | 7 |
| Mean length | 8.980638362 |
| Min length | 5 |
Characters and Unicode
| Total characters | 0 |
|---|---|
| Distinct characters | 0 |
| Distinct categories | 0 ? |
| Distinct scripts | 0 ? |
| Distinct blocks | 0 ? |
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.
Unique
| Unique | 0 ? |
|---|---|
| Unique (%) | 0.0% |
Sample
| 1st row | Student |
|---|---|
| 2nd row | Student |
| 3rd row | Student |
| 4th row | Student |
| 5th row | Working Professional |
Common Values
| Value | Count | Frequency (%) |
| Student | 44638 | |
| Other | 26840 | |
| Working Professional | 17719 | 19.9% |
Length
Histogram of lengths of the category
Pie chart
| Value | Count | Frequency (%) |
| student | 44638 | |
| other | 26840 | |
| professional | 17719 | 16.6% |
| working | 17719 | 16.6% |
Most occurring characters
| Value | Count | Frequency (%) |
| No values found. | ||
Most occurring categories
| Value | Count | Frequency (%) |
| No values found. | ||
Most frequent character per category
Most occurring scripts
| Value | Count | Frequency (%) |
| No values found. | ||
Most frequent character per script
Most occurring blocks
| Value | Count | Frequency (%) |
| No values found. | ||
Most frequent character per block
| Distinct | 17 |
|---|---|
| Distinct (%) | < 0.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 252.4601724 |
| Minimum | 160 |
|---|---|
| Maximum | 360 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negative | 0 |
| Negative (%) | 0.0% |
| Memory size | 697.0 KiB |
Quantile statistics
| Minimum | 160 |
|---|---|
| 5-th percentile | 180 |
| Q1 | 230 |
| median | 240 |
| Q3 | 280 |
| 95-th percentile | 340 |
| Maximum | 360 |
| Range | 200 |
| Interquartile range (IQR) | 50 |
Descriptive statistics
| Standard deviation | 46.09446804 |
|---|---|
| Coefficient of variation (CV) | 0.1825811477 |
| Kurtosis | -0.2255999941 |
| Mean | 252.4601724 |
| Median Absolute Deviation (MAD) | 30 |
| Skewness | 0.4141641082 |
| Sum | 22518690 |
| Variance | 2124.699984 |
| Monotonicity | Not monotonic |
Histogram with fixed size bins (bins=17)
| Value | Count | Frequency (%) |
| 230 | 16477 | |
| 240 | 14767 | |
| 280 | 7559 | |
| 180 | 7092 | |
| 270 | 6965 | |
| 250 | 5533 | 6.2% |
| 320 | 5146 | 5.8% |
| 340 | 4941 | 5.5% |
| 210 | 4038 | 4.5% |
| 260 | 3340 | 3.7% |
| Other values (7) | 13339 |
| Value | Count | Frequency (%) |
| 160 | 1885 | 2.1% |
| 180 | 7092 | |
| 190 | 236 | 0.3% |
| 200 | 1680 | 1.9% |
| 210 | 4038 | 4.5% |
| 220 | 2838 | 3.2% |
| 230 | 16477 | |
| 240 | 14767 | |
| 250 | 5533 | 6.2% |
| 260 | 3340 | 3.7% |
| Value | Count | Frequency (%) |
| 360 | 1810 | 2.0% |
| 340 | 4941 | 5.5% |
| 330 | 2712 | 3.0% |
| 320 | 5146 | 5.8% |
| 290 | 2178 | 2.4% |
| 280 | 7559 | |
| 270 | 6965 | |
| 260 | 3340 | 3.7% |
| 250 | 5533 | 6.2% |
| 240 | 14767 |
| Distinct | 43 |
|---|---|
| Distinct (%) | < 0.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 502.9802684 |
| Minimum | 30 |
|---|---|
| Maximum | 1000 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Negative | 0 |
| Negative (%) | 0.0% |
| Memory size | 697.0 KiB |
Quantile statistics
| Minimum | 30 |
|---|---|
| 5-th percentile | 134 |
| Q1 | 229 |
| median | 467 |
| Q3 | 714 |
| 95-th percentile | 909 |
| Maximum | 1000 |
| Range | 970 |
| Interquartile range (IQR) | 485 |
Descriptive statistics
| Standard deviation | 268.5694818 |
|---|---|
| Coefficient of variation (CV) | 0.5339562975 |
| Kurtosis | -1.207079747 |
| Mean | 502.9802684 |
| Median Absolute Deviation (MAD) | 238 |
| Skewness | 0.04366171291 |
| Sum | 44864331 |
| Variance | 72129.56656 |
| Monotonicity | Not monotonic |
Histogram with fixed size bins (bins=43)
| Value | Count | Frequency (%) |
| 628 | 9090 | 10.2% |
| 229 | 8104 | 9.1% |
| 317 | 4880 | 5.5% |
| 369 | 4679 | 5.2% |
| 909 | 4043 | 4.5% |
| 138 | 3889 | 4.4% |
| 180 | 3766 | 4.2% |
| 781 | 3678 | 4.1% |
| 840 | 3533 | 4.0% |
| 462 | 3264 | 3.7% |
| Other values (33) | 40271 |
| Value | Count | Frequency (%) |
| 30 | 167 | 0.2% |
| 44 | 1217 | 1.4% |
| 52 | 916 | 1.0% |
| 72 | 545 | 0.6% |
| 89 | 1339 | 1.5% |
| 95 | 227 | 0.3% |
| 134 | 986 | 1.1% |
| 138 | 3889 | |
| 156 | 1855 | |
| 178 | 1025 | 1.1% |
| Value | Count | Frequency (%) |
| 1000 | 962 | 1.1% |
| 990 | 1810 | |
| 909 | 4043 | |
| 900 | 458 | 0.5% |
| 892 | 702 | 0.8% |
| 884 | 1208 | 1.4% |
| 862 | 1065 | 1.2% |
| 840 | 3533 | |
| 819 | 1886 | |
| 806 | 534 | 0.6% |
| Distinct | 229 |
|---|---|
| Distinct (%) | 0.3% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 3.487797011 |
| Minimum | 0 |
|---|---|
| Maximum | 5 |
| Zeros | 198 |
| Zeros (%) | 0.2% |
| Negative | 0 |
| Negative (%) | 0.0% |
| Memory size | 697.0 KiB |
Quantile statistics
| Minimum | 0 |
|---|---|
| 5-th percentile | 1.96 |
| Q1 | 2.9 |
| median | 3.71 |
| Q3 | 4.15 |
| 95-th percentile | 4.63 |
| Maximum | 5 |
| Range | 5 |
| Interquartile range (IQR) | 1.25 |
Descriptive statistics
| Standard deviation | 0.8634980456 |
|---|---|
| Coefficient of variation (CV) | 0.2475769211 |
| Kurtosis | 0.6405838743 |
| Mean | 3.487797011 |
| Median Absolute Deviation (MAD) | 0.55 |
| Skewness | -0.8556303385 |
| Sum | 311101.03 |
| Variance | 0.7456288748 |
| Monotonicity | Not monotonic |
Histogram with fixed size bins (bins=50)
| Value | Count | Frequency (%) |
| 3.8 | 2172 | 2.4% |
| 3.77 | 2041 | 2.3% |
| 3.87 | 1894 | 2.1% |
| 2.7 | 1882 | 2.1% |
| 4.29 | 1827 | 2.0% |
| 3.83 | 1770 | 2.0% |
| 2.48 | 1702 | 1.9% |
| 2.6 | 1696 | 1.9% |
| 3.4 | 1566 | 1.8% |
| 4.31 | 1565 | 1.8% |
| Other values (219) | 71082 |
| Value | Count | Frequency (%) |
| 0 | 198 | |
| 0.02 | 28 | < 0.1% |
| 0.4 | 189 | |
| 0.42 | 28 | < 0.1% |
| 0.45 | 94 | 0.1% |
| 0.69 | 166 | |
| 0.86 | 117 | |
| 1.09 | 272 | |
| 1.1 | 180 | |
| 1.15 | 174 |
| Value | Count | Frequency (%) |
| 5 | 196 | |
| 4.98 | 26 | < 0.1% |
| 4.97 | 52 | 0.1% |
| 4.96 | 40 | < 0.1% |
| 4.95 | 53 | 0.1% |
| 4.94 | 75 | 0.1% |
| 4.93 | 95 | 0.1% |
| 4.92 | 149 | |
| 4.91 | 237 | |
| 4.9 | 271 |
Spearman's ρ
The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.
Pearson's r
The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.
Kendall's τ
Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.
Cramér's V (φc)
Cramér's V is an association measure for nominal random variables. The coefficient ranges from 0 to 1, with 0 indicating independence and 1 indicating perfect association. The empirical estimators used for Cramér's V have been proved to be biased, even for large samples. We use a bias-corrected measure that has been proposed by Bergsma in 2013 that can be found here.Phik (φk)
Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here. A simple visualization of nullity by column.
Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.
First rows
| row_id | user_id | category_id | video_id | age | gender | profession | followers | views | engagement_score | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 19990 | 37 | 128 | 24 | Male | Student | 180 | 1000 | 4.33 |
| 1 | 2 | 5304 | 32 | 132 | 14 | Female | Student | 330 | 714 | 1.79 |
| 2 | 3 | 1840 | 12 | 24 | 19 | Male | Student | 180 | 138 | 4.35 |
| 3 | 4 | 12597 | 23 | 112 | 19 | Male | Student | 220 | 613 | 3.77 |
| 4 | 5 | 13626 | 23 | 112 | 27 | Male | Working Professional | 220 | 613 | 3.13 |
| 5 | 6 | 9323 | 25 | 139 | 35 | Male | Other | 240 | 317 | 3.33 |
| 6 | 7 | 2071 | 7 | 14 | 23 | Male | Student | 160 | 467 | 3.80 |
| 7 | 8 | 21848 | 8 | 100 | 18 | Male | Student | 280 | 628 | 3.87 |
| 8 | 9 | 12896 | 3 | 4 | 15 | Male | Student | 270 | 621 | 2.88 |
| 9 | 10 | 16058 | 5 | 161 | 19 | Male | Student | 240 | 229 | 3.80 |
Last rows
| row_id | user_id | category_id | video_id | age | gender | profession | followers | views | engagement_score | |
|---|---|---|---|---|---|---|---|---|---|---|
| 89187 | 89188 | 23693 | 11 | 77 | 18 | Male | Working Professional | 250 | 180 | 3.65 |
| 89188 | 89189 | 10412 | 12 | 42 | 18 | Female | Student | 180 | 138 | 4.23 |
| 89189 | 89190 | 7203 | 6 | 11 | 20 | Male | Student | 210 | 362 | 4.44 |
| 89190 | 89191 | 19181 | 34 | 114 | 14 | Female | Student | 230 | 840 | 2.60 |
| 89191 | 89192 | 23102 | 37 | 128 | 41 | Male | Working Professional | 180 | 1000 | 3.19 |
| 89192 | 89193 | 23996 | 15 | 32 | 25 | Male | Other | 340 | 662 | 3.91 |
| 89193 | 89194 | 20466 | 20 | 47 | 31 | Male | Other | 240 | 892 | 3.56 |
| 89194 | 89195 | 13655 | 16 | 97 | 25 | Male | Student | 270 | 462 | 4.23 |
| 89195 | 89196 | 24840 | 9 | 18 | 35 | Male | Working Professional | 230 | 819 | 3.77 |
| 89196 | 89197 | 27183 | 25 | 150 | 13 | Male | Student | 240 | 317 | 4.31 |